Goto

Collaborating Authors

 online planning


AdaptiveOnlinePacking-guidedSearchforPOMDPs

Neural Information Processing Systems

Thepartially observableMarkovdecision process (POMDP) provides ageneral framework for modeling an agent's decision process with state uncertainty, and online planning plays a pivotal role in solving it. A belief is a distribution of states representing state uncertainty. Methods forlarge-scale POMDP problems rely on the same idea of sampling both states and observations.



Online Planning with Lookahead Policies

Neural Information Processing Systems

Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors.



DyRo-MCTS: A Robust Monte Carlo Tree Search Approach to Dynamic Job Shop Scheduling

Chen, Ruiqi, Mei, Yi, Zhang, Fangfang, Zhang, Mengjie

arXiv.org Artificial Intelligence

Dynamic job shop scheduling, a fundamental combinatorial optimisation problem in various industrial sectors, poses substantial challenges for effective scheduling due to frequent disruptions caused by the arrival of new jobs. State-of-the-art methods employ machine learning to learn scheduling policies offline, enabling rapid responses to dynamic events. However, these offline policies are often imperfect, necessitating the use of planning techniques such as Monte Carlo Tree Search (MCTS) to improve performance at online decision time. The unpredictability of new job arrivals complicates online planning, as decisions based on incomplete problem information are vulnerable to disturbances. To address this issue, we propose the Dynamic Robust MCTS (DyRo-MCTS) approach, which integrates action robustness estimation into MCTS. DyRo-MCTS guides the production environment toward states that not only yield good scheduling outcomes but are also easily adaptable to future job arrivals. Extensive experiments show that DyRo-MCTS significantly improves the performance of offline-learned policies with negligible additional online planning time. Moreover, DyRo-MCTS consistently outperforms vanilla MCTS across various scheduling scenarios. Further analysis reveals that its ability to make robust scheduling decisions leads to long-term, sustainable performance gains under disturbances.



Online Planning for Cooperative Air-Ground Robot Systems with Unknown Fuel Requirements

Agarwal, Ritvik, Hatami, Behnoushsadat, Gautam, Alvika, Maini, Parikshit

arXiv.org Artificial Intelligence

We consider an online variant of the fuel-constrained UAV routing problem with a ground-based mobile refueling station (FCURP-MRS), where targets incur unknown fuel costs. We develop a two-phase solution: an offline heuristic-based planner computes initial UAV and UGV paths, and a novel online planning algorithm that dynamically adjusts rendezvous points based on real-time fuel consumption during target processing. Preliminary Gazebo simulations demonstrate the feasibility of our approach in maintaining UAV-UGV path validity, ensuring mission completion. Link to video: https://youtu.be/EmpVj-fjqNY


Review for NeurIPS paper: Online Planning with Lookahead Policies

Neural Information Processing Systems

Additional Feedback: COMMENTS AFTER REBUTTAL Thank you for your response. However, in this paper's case I find that the significance of the paper (i.e., support for your claim that "theoretical results provided in this work are important on their own") is severely lacking without experiments showing a link between this theory and an algorithm's performance in terms of measures like running time, number of 1-step Bellman backups, etc. ***Note: this is not a claim that every theoretical paper needs experiments; it applies only to this specific work, due to the theory issues mentioned in the original review.*** The rebuttal's attempted arguments against providing experiments really miss the mark: -- The rebuttal gives the "Beyond the one step greedy approach in RL" as an example of a paper similar in the degree of its theoretical focus to this submission, but that paper actually has experiments! Light experiments could do the job. That "Beyond the one step greedy approach in RL" paper that you mentioned yourself is a case in point.


Autonomous Tail-Sitter Flights in Unknown Environments

Lu, Guozheng, Ren, Yunfan, Zhu, Fangcheng, Li, Haotian, Xue, Ruize, Cai, Yixi, Lyu, Ximin, Zhang, Fu

arXiv.org Artificial Intelligence

Trajectory generation for fully autonomous flights of tail-sitter unmanned aerial vehicles (UAVs) presents substantial challenges due to their highly nonlinear aerodynamics. In this paper, we introduce, to the best of our knowledge, the world's first fully autonomous tail-sitter UAV capable of high-speed navigation in unknown, cluttered environments. The UAV autonomy is enabled by cutting-edge technologies including LiDAR-based sensing, differential-flatness-based trajectory planning and control with purely onboard computation. In particular, we propose an optimization-based tail-sitter trajectory planning framework that generates high-speed, collision-free, and dynamically-feasible trajectories. To efficiently and reliably solve this nonlinear, constrained \textcolor{black}{problem}, we develop an efficient feasibility-assured solver, EFOPT, tailored for the online planning of tail-sitter UAVs. We conduct extensive simulation studies to benchmark EFOPT's superiority in planning tasks against conventional NLP solvers. We also demonstrate exhaustive experiments of aggressive autonomous flights with speeds up to 15m/s in various real-world environments, including indoor laboratories, underground parking lots, and outdoor parks. A video demonstration is available at https://youtu.be/OvqhlB2h3k8, and the EFOPT solver is open-sourced at https://github.com/hku-mars/EFOPT.


Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

Patel, Devdhar, Siegelmann, Hava

arXiv.org Artificial Intelligence

Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a "temporal recall" mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning.